An Extension of the CHAID Tree-based Segmentation Algorithm to Multiple Dependent Variables
نویسندگان
چکیده
The CHAID algorithm has proven to be an effective approach for obtaining a quick but meaningful segmentation where segments are defined in terms of demographic or other variables that are predictive of a single categorical criterion (dependent) variable. However, response data may contain ratings or purchase history on several products, or, in discrete choice experiments, preferences among alternatives in each of several choice sets. We propose an efficient hybrid methodology combining features of CHAID and latent class modeling (LCM) to build a classification tree that is predictive of multiple criteria. The resulting method provides an alternative to the standard method of profiling latent classes in LCM through the inclusion of (active) covariates. 1 Background and Summary of Approach The CHAID (Chi-Squared Automatic Interaction Detection) tree-based segmentation technique has been found to be an effective approach for obtaining meaningful segments that are predictive of a K-category (nominal or ordinal) criterion variable. For example, the dependent variable might be response to a mailing (responders vs. non-responders). Each of the resulting segments, depicted as a terminal node in a tree diagram, is defined as a combination of directly observable categorical predictors such as AGE = 18-24 & INCOME = $80,000+. Descriptive entries in each tree node consist of the sample size and the corresponding observed distribution on the dependent variable (e.g., associated response rate). Latent class (LC) models are useful in identifying segments that underlie multiple response variables. While the resulting latent classes can be either ordered (ordinal latent variable) or unordered (nominal latent variable), they are not actionable like CHAID segments, because by definition they are unobservable (latent). In this paper we propose a hybrid methodology that combines strengths of both approaches. After decomposing a set of M response variables into K ? Forthcoming in: C. Weihs und W. Gaul, (Editors). Classification: The Ubiquitous Challenge (2005). Springer, Heidelberg. 2 Magidson and Vermunt underlying latent class segments, a modified CHAID algorithm is used with the K latent classes serving as the K-category nominal (ordinal) criterion variable. The resulting CHAID segments, derived from selected demographic or other exogenous variables that are predictive of the classes, should also tend to be predictive of the M criterion variables. The hybrid method also provides an alternative to the use of covariates in LCM to profile the classes. In practice, one or more demographic or other exogenous variables are included in an LCM to describe/predict the latent classes using a multinomial logit model. The proposed CHAID-based alternative is especially advantageous when the number of covariates is large, when covariate effects are non-linear, or when there are complicated higher-order interactions. In the next section we provide brief introductions to the standard CHAID algorithm and the standard LC (cluster and factor) models. We then provide the technical details of the hybrid approach, followed by an empirical example from a pre-post survey (Burns, et. al, 2001). We conclude with some final remarks. 2 The CHAID algorithm The original CHAID algorithm was introduced by Kass (1980) for nominal dependent variables. CHAID is a recursive partitioning method useful in exploratory analyses that relate a potentially large number of categorical predictor variables to a single categorical nominal dependent variable. It was extended to ordinal dependent variables by Magidson (1993) who illustrated how this extension could be used to take advantage of fixed scores such as profitability, for each category of the dependent variable when such scores are known, as well as how to estimate meaningful scores when category scores are unknown. Chi-squared goodness of fit tests are used to identify significant predictors, and to merge predictor categories that do not differ in their prediction of the dependent variable. Predictor categories are eligible to be merged according to specified scale types. Any categories of Nominal (“free”) predictors can be merged, while only adjacent categories of ordinal or grouped continuous (“monotonic”) predictors are allowed to merge. A final scale type (“float”) may be used to specify that the variable is to be treated as monotonic except for the final category, often corresponding to a ‘don’t know’ or ‘missing’ response, which is free to merge with any of the other categories. Technical settings include significance levels associated with merging and splitting and a stopping rule. A case weight and a frequency variable may also be included in the analysis. As an example, Figure 1A illustrates a CHAID analysis based on data from a post-election survey on 1,051 persons who voted for either Bush or Gore in the 2000 U.S. election. The dependent variable (VOTE) is the candidate voted for and the predictors are 5 demographic variables: 1) MARSTAT CHAID for Multiple Dependent Variables 3
منابع مشابه
A Study to Improve the Response in Email Campaigning by Comparing Data Mining Segmentation Approaches in Aditi Technologies
Email marketing is increasingly recognized as an effective Internet marketing tool. In this study, a questionnaire is constructed and distributed to a sample of 146 prospects of Aditi Technologies to find the factors associated with higher response rates. The collected data is analyzed using Factor Analysis and the 11 factors, From Line, Subject Line, Personalization of the subject line, Timing...
متن کاملKnowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services
The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...
متن کاملA Proposed Model to Identify Factors Affecting Asthma using Data Mining
Introduction: The identification of asthma risk factors plays an important role in the prevention of the asthma as well as reducing the severity of symptoms. Nowadays, the identification process can be performed using modern techniques. Data mining is one of the techniques which has many applications in the fields of diagnosis, prediction, and treatment. This study aimed to identify the effecti...
متن کاملKnowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services
The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...
متن کاملIran’s Medical Tourism Development Model in Cardiac Surgery
Introduction: Since medical tourism is considered as an incremental activity in this sector and proper infrastructures in country to make medical tourism are lacking, announcement by authorities to provide perquisites of medical tourism to make the first clinic hotel and health town are necessary for the purpose of developing medical tourism in Iran, all of these side issues sh...
متن کامل